Skip to content

build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases#503

Merged
mart-r merged 25 commits into
mainfrom
build/medcat-and-den/CU-869ddh1jv-avoid-test-resources-in-releases
May 27, 2026
Merged

build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases#503
mart-r merged 25 commits into
mainfrom
build/medcat-and-den/CU-869ddh1jv-avoid-test-resources-in-releases

Conversation

@mart-r
Copy link
Copy Markdown
Collaborator

@mart-r mart-r commented May 22, 2026

The underlying issue

medcat-den source distribution are pushed to TestPyPI on every commit. And because they include test-time resources (test / fake models) they are rather large (~32MB). Over time this has meant we've reached PyPI's per project storage limit of 10GB. So now, because of this, medcat-den workflows on the main branch are failing because TestPyPI uploads are failing.

Caveats to consider

The idea of packaging your tests (along with the resources required to run them) is quite common for source distributions. In fact, the default behaviour seems to be to include everything that is tracked by git. There are a number of ways to get around this (i.e removing the files before building, pruning in MANIFEST.in), but they seem to be counter to the open source principles or not really following modern package building standards.

The proposed plan

In order to make this a viable option, I plan to store test time models centrally to the repo. This means that they won't be included in the builds since they're outside the scope of the source. But it also has the added benefit of allowing us to reused the same test models across multiple projects within the repo (e.g medcat and medcat-den, but why not medcat-service as well). On top of that there needs to be a way to access these files from a source distribution. And because that now doesn't include these test-time resources, they need to be fetached. The plan uses pooch to do the fetching from the relevant version on GitHub, but the logic defaults to local files if available. This will involve including these files in relevant releases as well. On the way there we also need to make some changes on the exact paths that are used to interact with these models in the test suite (but that shouldn't be extensive).

This is the plan:

  • Store test time models centrally
  • Implement for MedCAT
  • Implement for MedCAT-den
  • Implement workflow to add resources to relevant releases

@mart-r mart-r marked this pull request as draft May 22, 2026 11:53
@mart-r mart-r marked this pull request as ready for review May 22, 2026 15:18
Copy link
Copy Markdown
Collaborator

@alhendrickson alhendrickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, looks a lot better than before. Will be keen to get medcat-service in on this as well, with the complication that it also wants the models in the docker image.

Pooch looks great as well. I've got a few places (trainer, medcat service k8s) where I need to download assets on startup and might look to use this instead.

v2_model = "mct2_model_pack.zip"


def _get_version(project_name: str = 'medcat') -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a dumber option as it's all hardcoded anyway - could we just pin an exact stable version instead of using _get_version?

I'd probably rather always pull cogstack-nlp/releases/download/3.11/my_zip reliably, instead of being a dynamic URL to debug later

Right now it kind of looks like the medcat version will change the test model used, but I'd rather make it a deliberate change. I really dont see the test models changing any time soon

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying. And I agree that these files are unlikely to change frequently.

But I do think it's realistic that they may be added to at some point.

Hard-coding the version here would then mean we'd need to bump the version every time we do a release.
Realistically, we're never really running the tests in the state that this solution is designed for (i.e an install from source distribution that runs the tests).

with:
packages_dir: medcat-den/dist

# test-time models for download
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with the pin version comment -

Potentially we could do a release/tag for "medcat-test-models" separately, and test models from the medcat/medcat-den versions/releases? (Basically what you did with the release bundle idea).

With the assumption that the test models basically never change. Main advantage is it would probably be easier to reason about "medcat 3 is failing test X, but I know the test model zip is still version 1 and that exact file hasn't changed for a year", but I do see the advantage of versioning it all together as well, I'm sure you've already thought about this one! Definitely hard if the zip in source code changes but the released one doesnt...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up using this for more different parts it may make sense to centralise the entire logic into a installable package.

I think the main thing I want to avoid is yet another release cadence. As much as it's unlikely we'll need to bump this often, having these not tied to the releases of the specific tools that use them would (at least in my eyes) make it harder to track what version of which tool should work with what version of these test time models.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - yeah that's fair for sure that it gets harder to track. I'm all good with this as you have it!

@mart-r mart-r merged commit 8f97c76 into main May 27, 2026
27 checks passed
@mart-r mart-r deleted the build/medcat-and-den/CU-869ddh1jv-avoid-test-resources-in-releases branch May 27, 2026 11:42
mart-r added a commit that referenced this pull request May 27, 2026
* fix(medcat): CU-869ddh1jv: Fix lock file issue.

Issue was introduced in #503. Looks like because I didn't update uv.lock the entire dependency resolution was redone and broke.
This PR should - hopefully - fix it.

* CU-869ddh1jv: Disallow latest typer version

* CU-869ddh1jv: Disallow all recent typer versions

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants